Socio-demographic predictors of not having private dental insurance coverage: machine-learning algorithms may help identify the disadvantaged

Background For accessing dental care in Canada, approximately 62% of the population has employment-based insurance, 6% have some publicly funded coverage, and 32% have to pay out-of pocket. Those with no insurance or public coverage find dental care more unaffordable compared to those with private insurance. To support the development of more comprehensive publicly funded dental care programs, it is important to understand the socio-demographic attributes of all those, who find dental care unaffordable. Methods This study is a secondary analysis of the data collected from Ontarians during the latest available cycle of the Canadian Community Health Survey (2017-18), a cross-sectional survey that collects information on health status, health care utilization, and health determinants for the Canadian population. First, bivariate analysis was conducted to determine the characteristics of Ontarians who lack dental insurance. Afterwards, we employed machine learning (ML) to analyze data and identify risk indicators for not having private dental insurance. Specifically, we trained several supervised ML models and utilized Shapley additive explanations (SHAP) to determine the relative feature importance for not having private dental insurance from the best ML model [the gradient boosting (GBM)]. Results Approximately one-third of Ontarians do not have private insurance coverage for dental care. Individuals with an income below $20,000, those unemployed or working part-time, seniors aged above 70, and those unable to afford to have their own housing are more at risk of not having private dental insurance, leading to financial barriers in accessing dental care. Conclusion In the future, government-funded programs can incorporate these identified risk indicators when determining eligible populations for publicly funded dental programs. Understanding these attributes is critical for developing targeted and effective interventions, ensuring equitable access to dental care for Canadians. Supplementary Information The online version contains supplementary material available at 10.1186/s12889-024-18868-1.


Introduction
Canadians cover their dental expenses majorly in three ways: (1) 62% through private dental insurance (benefits from employer or purchased themselves); (2) 32% through out-of-pocket payments (paid at the point of care); and (3) 6% utilizing publicly funded targeted dental programs (federal/provincial/municipal).These public programs primarily target children (from low-income families), low-income seniors, eligible Indigenous individuals, people on social assistance, and those with disabilities [1,2].With many Canadians financing their own dental care through out-of-pocket expenses, cost becomes the predominant factor limiting access to care [3,4].Also, those covered through public programs, generally have limited coverage, which limits their treatment choices.Consequently, the current dental care financing system inequitably impacts those who need the care the most, a phenomenon of 'inverse care law' which has been observed among Canadian populations for accessing dental care [5].Previous studies revealed that insurance and income are the strongest predictors of reporting cost barriers to receiving adequate dental care in Canada [6][7][8][9], where those with low-income and no insurance reported more cost barriers to accessing dental care compared to their counterparts.Moreover, the method of payment for dental care impacts its affordability, with high-income families primarily relying on private insurance and low-income households mainly paying out-ofpocket [10].Therefore, the lack of affordable dental care and its negative implications could, in part, be addressed through interventions aimed at these two determinants.
Partly in response to this need, the federal government in 2022 has announced plans to establish a Canadian Dental Care Program (CDCP) for low-and middle-income Canadians [11].The plan would provide coverage for uninsured Canadians with a household income of less than $90,000 a year with no co-pays for those earning under $70,000.By the end of 2023, the program will start covering those who are under 18 years old, persons with disabilities, and seniors, with full implementation by the end of 2025 to cover all from families, who are financially eligible.This plan will be funded with an investment of $13 billion over five years, starting in 2023-24, and $4.4 billion ongoing for implementation [12].It is expected to support up to nine million uninsured Canadians once fully implemented.This initiative, which has tried to address both aspects, low income and no insurance, indeed would be the most significant health care initiative since Canadian Medicare was established [11].
Previous Canadian studies have categorized insurance status as follows: employment-based insurance, self-purchased insurance, government-based insurance, and no insurance.However, the upcoming Canadian Dental Care Plan simplifies this classification into two categories: insured individuals, who have any form of private insurance including employment-based or self-purchased, while uninsured are considered those, who do not have access to any form of private insurance; however, may be eligible for public insurance.Given that, it is important to know the socio-demographic attributes of those who do not have private dental insurance coverage.Understanding these risk attributes would provide an evidenceinformed scientific basis to policymakers to assess the eligible population for this upcoming CDCP.Hence, this paper analyzes the Canadian Community Health Survey (CCHS) 2017-18 data using machine learning to identify the stronger predictors for not having private dental insurance.

Materials and methods
Our study is a secondary data analysis of a cross-sectional national survey that covers 97% of the Canadian population, the CCHS (cycle: 2017-18, the latest cycle with oral health data of interest available).The CCHS gathers data on health status, healthcare utilization, and health determinants for the Canadian population.It targets individuals aged 12 and above residing in private dwellings across all 13 Canadian provinces and territories.However, it excludes individuals living on reserves, Indigenous settlements within provinces, full-time Canadian Forces members, and those in institutionalized settings from its sampling frame.The oral health and dental care questionnaires constitute optional content in the survey.This content was tailored to meet specific provincial-level requirements, leading to the inclusion of optional questions in select provinces during each survey cycle, with variations in their content.In the most recent cycle (2017-18), both oral health and dental care data were gathered for Ontario.For more details on the design and sampling features of the CCHS, please refer to the user guide [13].The Public Use Microdata Files (PUMF) for the 2017-18 CCHS data were accessed online using the Survey Documentation and Analysis (SDA) online tool available through the University of Toronto library at the Computing in the Humanities and Social Sciences (CHASS) portal.No ethics review was sought for the study, as this was a secondary data analysis of anonymized data that contained no personal identifiers, nor was it linked to any other data source [14].

Study populations and variables (or features)
This study utilized data from the CCHS encompassing all health regions in Ontario, including East, West, Central, North, and Toronto, with a total sample size of 19,799.The dataset includes variables related to sociodemographics, oral health, general and mental health (Table S1).The dependent variable is "Type of Dental Insurance", where we created a binary dependent variable where "having private dental insurance" (combining both employment-based and private dental insurance) and "not having private dental insurance" (combining government-based dental insurance and not having dental insurance) to achieve the study goal: to identify the population characteristics of not having private dental insurance.The individuals with government-based insurance and those without any insurance were grouped together as "not having private dental insurance" due to the specific socio-economic context in Ontario.In the current circumstances in Ontario, as these data are only from Ontario, the low income cut-off for being on public programs is very low, for example, for Healthy Smiles Ontario, it is $28.560 family income for a family with two children.A lot of Ontarians, with low-and middleincome do not have private insurance, as they are working on contract casual positions, or part-time, however do not qualify for public programs as they do not meet the income threshold.This has been recognized and therefore the upcoming Canadian Dental Care Plan has income eligibility of annual family net income of $90,000 or less.In addition, people who have public insurance or no insurance qualify for this program but not those, who have private insurance [11,15].

Data pre-processing, and feature engineering
Table S1 presents a list of independent variables, related to socio-demographics, oral health, general and mental health, and the dependent variable before and after preprocessing and feature engineering.We used 16 sociodemographic, 12 oral health, 4 general or mental health and one dental insurance variable are used in the subsequent analysis, as indicated in the last column of Table S1.

Machine learning: feature selection, modelling, analysis and evaluation
To prepare the dataset for analysis, categorical variables were transformed into dummy variables, a process also known as one-hot encoding that resulted cleaned dataset (dimensions 11,877 × 53 variables).Variance Inflation Factor (VIF) analysis was conducted as part of feature selection to detect the multicollinearity amongst the independent variables, where a VIF ≥ 5 indicates potential problematic levels of multicollinearity and VIF ≥ 10 indicates extreme multicollinearity (Figure S1) [16].The total dataset was split into 80% (n = 15,839) was allocated to training with 10-fold cross validation and 20% (n = 3960) was allocated to testing.The dependent variable to predict is "type of dental insurance" which was coded as binary: "having private dental insurance" (combining both employment-based and private dental insurance) and "not having private dental insurance".Before building ML models, it is critical to assess whether the training dataset exhibits class imbalance stemming from sub-categories within the dependent variable.Class imbalance can significantly affect the dependability, equity, and efficacy of ML models, potentially leading to misclassification of minority classes due to a bias towards the majority class.In our dataset, we noticed class imbalance where "having private dental insurance" (n = 12,710) and "not having private dental insurance" (n = 7089).To treat class imbalance, we implemented the following commonly used data-level resampling techniques, Oversampling (increase the number of instances in the minority class, not having private dental insurance), Under-sampling (decrease the number of instances in the majority class, having private dental insurance) and Synthetic Minority Over-sampling Technique (SMOTE, generates synthetic samples for the minority class, not having private dental insurance) [17].This will result in total of four datasets (three datasets from resampling techniques, and one original dataset).
Python v3.9.17 and packages scikit-learn 1.4.1 were used to build supervised ML models which ranged from commonly used classifiers including logistic regression (LR), penalized LR with least absolute shrinkage and selection operator LASSO (LR-LASSO), LR with RIDGE (LR-RIDGE), decision tree (DTREE), random forest (RF), adaptive boosting (ADB), bootstrap aggregating (BAG), gradient boosting (GBM), and extreme gradient boosting (XGBT v1.7.3).Due to the presence of class imbalance and the consequences of undermining the false negatives, the recall (sensitivity) metric was chosen to identify the best ML classifiers on the training datasets.This best ML classifier with each training dataset was further examined for its performance and conducted hyper-parameter tuning or optimization (the process of selecting the optimal hyper-parameters for a ML algorithm to maximize its performance on a given dataset).The models' performance was evaluated by assessing accuracy, precision, recall, F1-score, and the area under the receiver operating characteristic (AUROC) [18].Finally, the Shapley Additive Explanations (SHAP, v0.44.1) package was used to interpret the impact of top discriminatory variables on the model with the highest relative performance.SHAP uses cooperative game theory to calculate the marginal contribution of each feature and examines the feature influence on model prediction [19].Scikit-learn v0.24.2, a popular Python library for data science and machine learning tasks was implemented.

Statistical analysis
All categorical variables were presented as numbers and percentages.The statistical significance was calculated using Pearson's Chi-square test.R package, Arsenal v3.2.7    was used to prepare large-scale statistical summaries i.e., Table 1.Additionally, more than 50% of those who were unemployed, housing in a rented accommodation, and perceived their oral or general health as "fair" to "poor" had no private dental insurance.Nine ML algorithms were deployed and evaluated for their ability to distinguish between having private dental insurance and not having private dental insurance, using the original dataset and three additional datasets generated through resampling techniques (Figure S2).The highest performing ML classifiers, as determined by the recall (sensitivity) metric across the four training datasets, were as follows: Random Forest with the original dataset (recall: 53.88%), Random Forest with the SMOTE dataset (recall: 84.07%), Random Forest with the Oversampling dataset (recall: 89.31%), and Gradient Boosting with the under-sampling dataset (recall: 74.30%).These models were further optimized by selecting optimal hyper-parameters to maximize each model's performance on a given dataset.

Baseline characteristics of the study populations
Model evaluation metrics of these ML models were measured to identify the most effective model based on performance (Table 2).The GBM model, trained on the under-sampling dataset, demonstrated superior performance in reducing false negatives, as evidenced by its high training recall of 0.7702 and test recall of 0.7522.
Additionally, the model's training accuracy of 0.7708 and test accuracy of 0.7513 indicate a well-balanced model with no significant over-fitting or under-fitting issues.An important observation is that LR, LR-LASSO and LR-RIDGE models with three resampled datasets performed well next to GBM.All models did not perform well on the original dataset, attributed to the inherent class imbalance.
Feature importance (Risk attributes).Shapley additive explanations (SHAP) was used to determine the relative feature importance for having private dental insurance and not having private dental insurance from the best ML model (i.e., GBM).The SHAP summary plot in Fig. 1 combines feature importance with the feature effects on the model.The figure presents feature importance scores to determine the relative importance of each attribute in predicting the risk of not having dental insurance.The blue-red coloured bar presents the impact of feature value on the model.The attributes predicting higher risk are marked red while those of low risk are marked in the blue colour.As per the model, the attributes identified to be associated with not having private dental insurance include: 'household income < 20,000' , 'avoided dental professional due to cost within last 12 months' , 'unemployment status' , 'part-time employment' , 'patient's age-group > = 70-79' , and 'dwelling rented.'

Discussion
Using data from the latest available cycle of the CCHS (2017-18), this study identified population's attributes for not having private dental insurance in Ontario.As the Canadian health care system excludes dental care, except for surgical dental procedures performed in hospital settings, Canada faces inequitable utilization of the oral health care system [20].It is well established that low income and lack of insurance coverage play a crucial role in limiting people's ability to access oral health care [7,8,21].Additionally, literature on the subject has pointed out that although income and insurance are positively correlated, insurance has an independent effect on dental care utilization [22].Regardless of income level, insured individuals are more likely to utilize and have better access to dental services than their uninsured counterparts [8,10].
Previous studies have reported that poor access to dental care has a negative impact on individuals' health, health care system and society [23][24][25][26][27]. Therefore, with 36% of the population lacking private dental insurance (including employment-based and self-purchased plans), it is important to identify what are the attributes of those disadvantaged population.This would provide policymakers with data-driven evidence on who should be included in the upcoming CDCP.By prioritizing the most vulnerable groups, existing oral health inequities can be expected to be addressed, thereby achieving better oral health for all.
In our study, we employed machine learning to analyze data and identify predictors for not having private dental insurance.Machine learning offers a robust and systematic approach to extracting patterns, relationships, and insights from complex datasets.Based on our findings, the GBM algorithm, in combination with under-sampling techniques, proved to be the most effective method for identifying predictors related to not having private dental insurance.This combination exhibited remarkable performance metrics, showcasing its proficiency in this context.The GBM algorithm, coupled with under-sampling, demonstrated impressive performance metrics (Table 2).This suggests not only effective handling of class imbalance but also a good generalization to unseen data without over-fitting or under-fitting.The evaluation of machine learning models on the original and three resampled datasets highlights the importance of choosing the right data preprocessing technique in conjunction with the model type.
In our ML model, the most significant variable for not having dental insurance was the cost barriers to dental care, followed by unemployment.It's crucial to acknowledge the limitations of this cross-sectional study, as we cannot definitively determine whether cost-barriers to dental care is a risk indicator for not having dental insurance or if it is the other way around, where not having dental insurance is a risk indicator for avoiding dental visits due to financial barriers.Nevertheless, our results confirm a strong correlation between these two attributes.In Canada, employment-based plans constitute the majority of private dental insurance [28].Employers offer non-wage benefits such as insurance plans to enhance the employer-employee contract; this offer is voluntary not obligatory [29].Unemployment signifies the absence of workplace benefits, or the inability to self-purchase plans, resulting in the lack of private dental insurance.Additionally, part-time employment was a predictor in reporting no dental insurance.Part-time employment is defined as working less than 30 h per week, typically with lower hourly rates than their full-time counterparts [30,31].Employers as such offer dental benefits more to their permanent full-time employees than those who work on contract or part-time [32].
The third strongest predictor for having no insurance was being 70-79 years old; this might be explained by the fact that retired people lose their work benefits, particularly the employment-sponsored insurance.Additionally, retirees often rely on a limited fixed pension, restricting their ability to purchase private plans [33].
The fourth and fifth strongest predictors in our model are low income (less than $20,000/ year) and renting a house.Studies indicate that some low-wage workers remain uninsured, even when eligible for employer-based coverage, as they prioritize more pay in lieu of health benefits to cover other expenses [34,35].Furthermore, lack of home ownership status reflects housing insecurity that may affect the affordability of purchasing private dental insurance [36].
Our study has some methodological and study design limitations.It is a secondary data analysis of a national survey, thereby precluding the detection or correction of data entry errors from the original survey.Additionally, the CCHS is a cross-sectional survey and only associations can be assessed and no causal relationships can be inferred from this study.Further, the CCHS excluded individuals living on reserves and other Indigenous settlements in the provinces as well as the institutionalized population, potentially leading to underestimated findings and limited generalizability.Finally, potential measurement errors might have been introduced by respondent recall errors, inconsistency of their opinion, and the respondents' tendency to provide socially desirable answers.That said, it is a generalized limitation of such surveys and studies based on such surveys.At the same time, using such large population-based data gives strength to the study.A substantial sample size, enabled us to make population-level estimations in Ontario.Moreover, the study employs machine learning to offer data-driven evidence to policymakers regarding vulnerable groups that need to be included in the upcoming national dental care plan.

Conclusions
Approximately, one-third of Canadians do not have private insurance coverage for utilizing dental care.People with annual income of less than $20,000; are unemployed or having part-time employment; seniors above 70 years of age; and those, who are not able to afford their own house are more at risk of not having dental insurance and thereby face cost barriers to access dental care.Future government funded programs need to take into consideration these attributes when deciding the target populations eligible for publicly funded dental programs to ultimately address existing inequities in the Canadian oral health care system.Learning of these attributes can be helpful for other Organization for Economic Co-operation and Development countries as well when assessing eligible populations for publicly funded dental care programs.Also, this study underscores the complexity of model selection and the impact of data preprocessing techniques on machine learning performance.It highlights the necessity for careful consideration of modeldata compatibility to achieve optimal performance and reliable predictive capabilities.

Fig. 1
Fig.1Gradient Boosting Model (GBM) -feature importance for having private dental insurance and not having private dental insurance.Shapley additive explanations (SHAP) was used to determine the relative feature importance for having private dental insurance and not having private dental insurance

Table 1
Baseline characteristics of CCHS data

Table 2
Selection of Optimal machine learning models based on the model evaluation matricesOverfitting: occurs when a model learns the detail and noise in the training data to the extent that it negatively impacts the model's performance on new data.The model performs exceptionally well on the training data but poorly on the test data or new, unseen data